OmniDocBench 1.5

A comprehensive benchmark for evaluating diverse PDF document parsing — covering text OCR, table recognition, formula extraction, and layout detection across 1,355 real-world pages

Published

September 10, 2025

Keywords: OmniDocBench, document parsing benchmark, PDF parsing, OCR evaluation, table recognition, formula recognition, layout detection, CVPR 2025, OpenDataLab, Shanghai AI Laboratory, document understanding, VLM evaluation, pipeline evaluation, TEDS, CDM, edit distance

Introduction

Large language models and RAG systems are only as good as the documents they can parse. Yet the ability of AI to accurately extract text, tables, formulas, and layout from real-world PDFs — academic papers, financial reports, handwritten notes, newspapers — has lacked a fair, comprehensive benchmark.

OmniDocBench fills this gap. It is a rigorously annotated benchmark spanning 1,355 PDF pages across 9 document types, 4 layout types, and 3 languages, with over 20,000 block-level and 80,000 span-level annotations. Version 1.5 (September 2025) expanded the dataset with 374 new pages, balanced Chinese/English coverage, and introduced an improved evaluation methodology.

“Despite recent progress, current document parsing methods have not been fairly and comprehensively evaluated due to the narrow coverage of document types and the simplified, unrealistic evaluation procedures in existing benchmarks.” — OmniDocBench Paper

graph LR
    A["Existing Doc Benchmarks<br/>Limited doc types<br/>Simplified evaluation"] --> B["Unfair comparisons<br/>between models"]
    B --> C["OmniDocBench 1.5<br/>1,355 pages · 9 doc types<br/>Multi-level evaluation"]
    C --> D["Fair, fine-grained<br/>document parsing<br/>evaluation"]

    style A fill:#e74c3c,stroke:#333,color:#fff
    style B fill:#f39c12,stroke:#333,color:#fff
    style C fill:#27ae60,stroke:#333,color:#fff
    style D fill:#3498db,stroke:#333,color:#fff

What Is OmniDocBench?

OmniDocBench is a benchmark for evaluating diverse PDF document parsing in real-world scenarios. It assesses how well AI systems can convert complex PDF pages into structured, machine-readable output (typically Markdown), covering text extraction, table recognition, formula parsing, layout detection, and reading order.

Key Characteristics

Feature	Details
Total pages	1,355 PDF pages (v1.5)
Document types	9 — academic papers, textbooks, financial reports, newspapers, handwritten notes, PPTs, magazines, test papers, books
Layout types	4 — single-column, double-column, three-column, complex
Languages	3 — English, Chinese, mixed
Block-level annotations	15 categories (text paragraphs, headings, tables, etc.) — over 20,000
Span-level annotations	4 categories (text lines, inline formulas, subscripts, etc.) — over 80,000
Table annotations	Both LaTeX and HTML formats
Formula annotations	LaTeX format with language attributes
Reading order	Full reading-order annotations for document components
Attribute labels	5 page-level + 3 text-level + 6 table-level attribute tags
License	Apache-2.0
Accepted at	CVPR 2025

What Makes It Comprehensive?

Unlike narrow benchmarks that focus on a single document type or a single extraction task, OmniDocBench evaluates five distinct capabilities across diverse, real-world documents:

graph TD
    ODB["OmniDocBench 1.5<br/>1,355 PDF pages"] --> E2E["End-to-End<br/>Document Parsing"]
    ODB --> OCR["Text OCR<br/>Recognition"]
    ODB --> TAB["Table<br/>Recognition"]
    ODB --> FORM["Formula<br/>Recognition"]
    ODB --> LAY["Layout<br/>Detection"]

    E2E --> M1["Edit Distance · BLEU<br/>METEOR · TEDS · CDM"]
    OCR --> M2["Normalized<br/>Edit Distance"]
    TAB --> M3["TEDS<br/>(Tree Edit Distance)"]
    FORM --> M4["CDM<br/>(Character Detection Matching)"]
    LAY --> M5["COCODet<br/>(mAP, mAR)"]

    style ODB fill:#e74c3c,color:#fff,stroke:#333
    style E2E fill:#3498db,color:#fff,stroke:#333
    style OCR fill:#27ae60,color:#fff,stroke:#333
    style TAB fill:#f39c12,color:#fff,stroke:#333
    style FORM fill:#8e44ad,color:#fff,stroke:#333
    style LAY fill:#e67e22,color:#fff,stroke:#333
    style M1 fill:#ecf0f1,color:#333,stroke:#bdc3c7
    style M2 fill:#ecf0f1,color:#333,stroke:#bdc3c7
    style M3 fill:#ecf0f1,color:#333,stroke:#bdc3c7
    style M4 fill:#ecf0f1,color:#333,stroke:#bdc3c7
    style M5 fill:#ecf0f1,color:#333,stroke:#bdc3c7

Version 1.5 Updates (September 2025)

OmniDocBench v1.5 introduced several important improvements over v1.0:

+374 new pages — balanced Chinese/English page counts and increased formula-rich pages
Higher resolution — newspaper and note images upgraded from 72 DPI to 200 DPI
Improved matching algorithm — formulas and text can now be matched with each other, reducing score errors from Unicode formula outputs
Simplified Overall metric — now calculated as: \text{Overall} = \frac{(1 - \text{Text Edit Distance}) \times 100 + \text{Table TEDS} + \text{Formula CDM}}{3}
Language attributes for formulas — 68 Chinese + 982 English formulas
Inline formulas increased from 353 to 1,050

Who Built It?

OmniDocBench was developed by researchers at OpenDataLab / Shanghai AI Laboratory (Shanghai Artificial Intelligence Laboratory). The lead authors include:

Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, Jin Shi, Fan Wu, Pei Chu, Minghao Liu, Zhenxiang Li, Chao Xu, Bo Zhang, Botian Shi, Zhongying Tu, Conghui He

The project was published at CVPR 2025 (IEEE/CVF Conference on Computer Vision and Pattern Recognition), one of the top-tier venues in computer vision.

Resource	Link
arXiv paper	arxiv.org/abs/2412.07626
GitHub	github.com/opendatalab/OmniDocBench
Official site	opendatalab.com/omnidocbench

What Skills Does It Test?

OmniDocBench evaluates the full spectrum of document understanding capabilities:

Capability	What It Tests	Metric
End-to-end parsing	Full-page PDF-to-Markdown conversion — text, tables, formulas, reading order combined	Overall (composite), Edit Distance
Text OCR	Accurate recognition of text paragraphs across languages, fonts, and layouts	Normalized Edit Distance
Table recognition	Structural and content extraction of tables (simple, complex, merged cells)	TEDS (Tree Edit Distance Similarity)
Formula recognition	Correct LaTeX transcription of display and inline formulas	CDM (Character Detection Matching)
Layout detection	Localization and classification of document components (text, tables, figures, etc.)	COCODet metrics (mAP, mAR)
Reading order	Correct sequencing of document elements for downstream processing	Edit Distance

Three Categories of Models Evaluated

OmniDocBench evaluates three fundamentally different approaches to document parsing:

graph LR
    A["Specialized VLMs<br/>PaddleOCR-VL, MinerU,<br/>MonkeyOCR, Dolphin"] --> D["End-to-End<br/>Leaderboard"]
    B["General VLMs<br/>Qwen3-VL, Gemini,<br/>GPT-4o, InternVL"] --> D
    C["Pipeline Tools<br/>PP-StructureV3, Marker,<br/>MinerU-pipeline"] --> D

    style A fill:#3498db,color:#fff,stroke:#333
    style B fill:#27ae60,color:#fff,stroke:#333
    style C fill:#8e44ad,color:#fff,stroke:#333
    style D fill:#e74c3c,color:#fff,stroke:#333

Current Leaderboard

The leaderboard below shows the end-to-end document parsing results on OmniDocBench v1.5. The Overall score is the composite metric: \frac{(1 - \text{Edit Dist}) \times 100 + \text{TEDS} + \text{CDM}}{3}. Higher Overall is better; lower Edit Distance is better.

Source: OmniDocBench GitHub Repository (consulted March 29, 2026). Dataset version 1.5 (September 2025).

Specialized Document VLMs

Rank	Model	Size	Overall ↑	Edit Dist ↓	Table TEDS ↑	Formula CDM ↑
1	PaddleOCR-VL	0.9B	92.86	0.035	90.89	94.76
2	MinerU2.5	1.2B	90.67	0.047	88.22	92.38
3	OpenDoc-0.1B	0.1B	90.49	0.043	88.05	91.97
4	MonkeyOCR-pro-3B	3B	88.85	0.075	86.78	90.63
5	OCRVerse	4B	88.56	0.058	84.55	88.45
6	dots.ocr	3B	88.41	0.048	86.78	90.62
7	MonkeyOCR-3B	3B	87.13	0.075	81.39	85.92
8	Deepseek-OCR	3B	87.01	0.073	84.97	88.80
9	MonkeyOCR-pro-1.2B	1.2B	86.96	0.084	84.24	89.02
10	Nanonets-OCR-s	3B	85.59	0.093	80.14	85.57
11	MinerU2-VLM	0.9B	85.56	0.078	83.54	87.66
12	Dolphin-1.5	0.3B	83.21	0.092	78.06	84.10
13	olmOCR	7B	81.79	0.096	68.92	74.77
14	POINTS-Reader	3B	80.98	0.134	77.13	81.66
15	Mistral OCR	—	78.83	0.164	70.03	78.04
16	OCRFlux	3B	74.82	0.193	75.75	80.23
17	Dolphin	0.3B	74.67	0.125	68.70	77.77

General Vision-Language Models

Rank	Model	Size	Overall ↑	Edit Dist ↓	Table TEDS ↑	Formula CDM ↑
1	Qwen3-VL-235B	235B	89.15	0.069	86.21	90.55
2	Gemini-2.5 Pro	—	88.03	0.075	85.71	90.29
3	Qwen2.5-VL	72B	87.02	0.094	82.15	86.22
4	InternVL3.5	241B	82.67	0.142	75.00	81.28
5	InternVL3	78B	80.33	0.131	70.64	77.74
6	GPT-4o	—	75.02	0.217	67.07	76.09

Pipeline Tools

Rank	Model	Overall ↑	Edit Dist ↓	Table TEDS ↑	Formula CDM ↑
1	PP-StructureV3	86.73	0.073	81.68	89.48
2	Mineru2-pipeline	75.51	0.209	70.90	79.11
3	Marker-1.8.2	71.30	0.206	57.88	71.17

Key takeaways:

PaddleOCR-VL (0.9B) leads overall at 92.86, proving that specialized smaller models can outperform massive general VLMs on document parsing
Among general VLMs, Qwen3-VL-235B (89.15) and Gemini-2.5 Pro (88.03) compete closely with specialized models
GPT-4o scores only 75.02 — significantly behind purpose-built document parsers
Tiny models like OpenDoc-0.1B (90.49) and Dolphin-1.5 (0.3B, 83.21) demonstrate impressive efficiency for their size

Where to Explore the Benchmark

Dashboards and Resources

Resource	Description	Link
Official Site	OpenDataLab’s OmniDocBench leaderboard and dataset portal	opendatalab.com/omnidocbench
GitHub Repository	Evaluation code, configs, inference scripts, and result tables	github.com/opendatalab/OmniDocBench
Hugging Face Dataset	Download the 1,355-page annotated dataset (1.25 GB)	huggingface.co/datasets/opendatalab/OmniDocBench
OpenDataLab Dataset	Alternative dataset download	opendatalab.com/OpenDataLab/OmniDocBench
arXiv Paper	Full technical paper with methodology and analysis	arxiv.org/abs/2412.07626

Load the Dataset

from datasets import load_dataset

dataset = load_dataset("opendatalab/OmniDocBench", split="train")
print(f"Total pages: {len(dataset)}")
# Total pages: 1358

Run the Evaluation

# Setup
conda create -n omnidocbench python=3.10
conda activate omnidocbench
pip install -r requirements.txt

# Run evaluation with your model's markdown output
python pdf_validation.py --config configs/end2end.yaml

The evaluation framework supports flexible configuration files for each task: end2end, md2md, table recognition, formula recognition, OCR, and layout detection.

Understanding the Metrics

Overall Score

The primary end-to-end metric combines three component scores:

\text{Overall} = \frac{(1 - \text{Text Edit Distance}) \times 100 + \text{Table TEDS} + \text{Formula CDM}}{3}

This gives equal weight to text, table, and formula extraction quality.

Component Metrics

Metric	Range	What It Measures	Used For
Edit Distance	0–1 ↓	Character-level differences between predicted and ground-truth text	Text OCR, reading order
TEDS	0–100 ↑	Tree Edit Distance Similarity — structural + content accuracy of tables	Table recognition
CDM	0–100 ↑	Character Detection Matching — precision of formula LaTeX transcription	Formula recognition
BLEU / METEOR	0–1 ↑	Standard NLP metrics for text similarity	Alternative text quality measures
mAP / mAR	0–1 ↑	COCO detection metrics for bounding box localization	Layout detection

Document Types and Attribute Analysis

OmniDocBench goes beyond aggregate scores by providing fine-grained, attribute-level results. You can break down performance by:

Document type — academic paper, textbook, financial report, newspaper, handwritten note, PPT, magazine, test paper, book
Layout complexity — single-column, double-column, three-column, complex
Language — Chinese, English, mixed
Table attributes — simple vs. complex, with/without merged cells, colored backgrounds
Text attributes — font size, orientation, special characters

Why OmniDocBench Matters

graph LR
    A["LLMs & RAG need<br/>accurate document<br/>parsing"] --> B["Existing benchmarks<br/>too narrow"]
    B --> C["OmniDocBench<br/>fills the gap"]
    C --> D["Better document AI<br/>for real-world use"]

    A2["Models compared<br/>unfairly"] --> B2["Different eval<br/>methodologies"]
    B2 --> C
    C --> D2["Standardized,<br/>reproducible<br/>benchmarking"]

    style A fill:#e74c3c,color:#fff,stroke:#333
    style A2 fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style D fill:#3498db,color:#fff,stroke:#333
    style D2 fill:#3498db,color:#fff,stroke:#333

Diverse and realistic — 9 document types covering the full range of real-world PDFs, not just academic papers
Multi-level evaluation — end-to-end, task-specific, and attribute-level analysis to pinpoint model weaknesses
Fair comparison — standardized evaluation code ensures reproducible, apples-to-apples comparisons
Covers the full pipeline — text, tables, formulas, layout, and reading order in a single benchmark
Active community — 1.6k GitHub stars, 14 contributors, regular model additions (Docker support added November 2025)

Video: OmniDocBench Explained

Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀

Conclusion

OmniDocBench 1.5 sets the standard for document parsing evaluation:

1,355 PDF pages across 9 document types, 4 layouts, and 3 languages — far broader than any predecessor
100,000+ annotations at block and span levels, with reading order and multi-format table/formula ground truth
Five evaluation dimensions — end-to-end, text OCR, table, formula, and layout detection
The best specialized model (PaddleOCR-VL) achieves 92.86 Overall — but general VLMs like GPT-4o still score only 75.02, revealing a significant gap
Accepted at CVPR 2025 and actively maintained with regular model updates

As document AI becomes critical infrastructure for LLMs and RAG systems, OmniDocBench provides the rigorous, multi-dimensional evaluation needed to drive real progress — not just on cherry-picked academic papers, but across the messy diversity of real-world documents.

References

Ouyang, L., Qu, Y., Zhou, H., Zhu, J., Zhang, R., Lin, Q., Wang, B., Zhao, Z., Jiang, M., Zhao, X., Shi, J., Wu, F., Chu, P., Liu, M., Li, Z., Xu, C., Zhang, B., Shi, B., Tu, Z., He, C. “OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations.” CVPR 2025. arxiv.org/abs/2412.07626
OpenDataLab. “OmniDocBench Dataset.” Hugging Face. huggingface.co/datasets/opendatalab/OmniDocBench
OpenDataLab. “OmniDocBench GitHub Repository.” github.com/opendatalab/OmniDocBench
OpenDataLab. “OmniDocBench Official Site.” opendatalab.com/omnidocbench

Explore how models handle multimodal understanding beyond documents — see MMMU-Pro
Evaluate LLMs on scientific figure comprehension — see CharXiv Reasoning
Track model costs when running evaluations — see FinOps Best Practices for LLM Applications
Deploy models for running your own evaluations — see Deploying and Serving LLM with vLLM
OmniDocBench GitHub Repository
OmniDocBench Dataset on Hugging Face
OmniDocBench Official Site